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EXTENDING STREAMFLOW DATA! 


W. B. Langbein,2 and C. H. Hardison,2 
Associate Members, ASCE 


ABSTRACT 


Inadequate length and number of streamflow records are common deficien- 
cies in the design of water projects. A solution to the problem of providing 
more coverage may be approached by applying the recognized correlation 
between the discharges of nearby streams, similarly situated. Correlations 
developed by graphic plots of contemporaneous monthly discharges, generally 
confirm those based on daily data. Graphs of best fit are drawn to an *engi- 
neers” line rather than a least squares regression. Correlation coefficients 
(r) are nearly always greater than 0.8; theory indicates that extending data 
by correlation is profitable wherever r exceeds 0.44 and one record is at 
least 25% longer than the short record. Correlative estimates obtained by 
the methods described contain the variability, group distribution, and other 
significant characteristics of an actual record; and therefore provide useful 
information on water resources design and appraisal. As demands for in- 
formation mount, an increasing part of the stream-gaging program may be 
devoted to the operation of secondary or roving stations that are satellite to 
a firm network of long-term base stations. 


INTRODUCTION 


For some years the basic-data services have contended with a demand for 
information exceeding their facilities for providing it. In the stream-gaging 
field alone, the several federal agencies have established a goal of about 
5,800 additional gaging stations, which would about double the present network. 
(FIARBC 1950). And this list does not even consider the as yet uncodified 
needs of State and local governments or those of the general public, which 
must also be met by the basic-data services. These needs when added togeth- 
er loudly bespeak an impelling and widely recognized need for the stream- 
gaging network to keep pace with and to lead the growing water development. 
With the recent increase in interest in supplemental irrigation in the East 
and in upstream flood control and stream polution throughout the country, it 
is becoming apparent that the number of sites where streamflow information 
is needed will be greater than can ever be gaged at any one time unless there 
is a large unexpected increase in the amount of funds available. 

Because of the chronic lag in collecting adequate records, there are few 
hydrologic projects for which streamflow data are complete in all desirable 
respects. A common deficiency is that the available record, if there is one, 
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is short, and information is needed as to the flow for a more representative 
period of time. The problem is to infer the long-term characteristics of 
river flow from the short record. Moreover, one can readily see that if 
sound methods for drawing such inferences can be devised, the number of 
gaged sites can be greatly increased by shorter periods of operation and the 
areal coverage of streamflow data expanded. 


Kind of Data Needed 


Runoff data may be required for use in one of two rather distinct ways: 


1. The actual day by day runoff may be required, as in the operation of 
developed projects and in the adjudication of water rights and damages. As 
the rivers of the country become more fully developed the number of gaging 
stations required for water management will grow. 

2. Information may be needed on the variability of sequence of volumes 
and rates of flow without necessary regard to their calendar dates. Develop- 
ments of water resources are usually based on estimates of the nature of the 
future stream regimen. Past records are used as a sample of what may be 
expected in the future. The essential purpose of this paper is to discuss ways 
by which short records can be used effectively to draw such inferences. The 
title of this paper is ambiguous purposely in that if data from a short record 
of stream flow can be * extended” lengthwise, a more “extended” areal cover- 
age will be feasible. The Geological Survey has given increased attention to 
ways and means by which it might be possible to obtain some information on 
a large number of streams not now covered by stream-gaging stations. 


The problem of providing more coverage is being approached by taking 
advantage of the recognized correlation between the hydrographs of nearby 
streams similarly situated. There is relative correspondence in the day to 
day, month to month, and year to year variation in the flow of nearby streams, 
that has been fairly well established in hydrologic experience, but has yet to 
be exploited fully, either by users of streamflow records, or in the design 
of stream-gaging networks. Broad coverage of streamflow information can 
be provided through operation and frequent relocation of short-term gaging 
stations and the operation of partial-record gaging stations where only frag- 
mentary data would be collected. These records can be effectively used only 
through correlation with long-term base stream-gaging stations. General 
coverage hinges on an adequate network of base gaging stations and the de- 
velopment of techniques for transferring information from the base station 
to the satellite short-term sites. The base stations would provide the sam- 
pling of the time sequence of events, whereas the short-term and partial- 
record stations would provide the extensive areal sampling that is becoming 
increasingly necessary. 

There are two classes of information that hydrologists look for in a 
stream-gaging record. The first is that kind of information that represents 
the more or less relatively fixed or invariable characteristics. These are, 
for example, denoted by such things as unit hydrographs, channel-storage 
routing coefficients, and state-discharge curves. The other kind of informa- 
tion is that which is contained in the time-variable characteristics; these are, 
for example, mean rates and volumes, duration curves, storage-draft curves, 
flood-frequency graphs, and frequency of low flows. 

It appears from what studies we have made, that the length of record will 
be controlled in most cases by the necessity to evaluate the time-variable 
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factors. Ordinarily, if a satellite station is operated for a period sufficiently 
long to derive a satisfactory correlation with a base station, the record will 

also contain sufficient information from which to derive the unit hydrographs 
and other nominally fixed hydrologic factors. 


How Correlations Can be Made 


Although the principles involved in correlating short-term records and 
partial records with a long-term stream-gaging station are easy to grasp, 
the details of the techniques are far from completely developed. The correla- 
tion should provide unbiased analyses of hydrologic relationships and com- 
mon grounds for evaluating results. Statistical reasoning and methods should 
be used, always keeping in mind that statistics should be the tool of hydrology 
and not vice versa. We have found that the hydrologic aspects are most clear- 
ly kept in view when graphical statistical correlations are used in lieu of 
analytical methods generally described in most textbooks. 

In order that simple statistical tests may be trustworthy, the data used 
should have a normal distribution. The fact that the mean of streamflow data 
is almost always higher than the median indicates that these data have a pos- 
itively skewed distribution. Fortunately much of this skewness can be re- 
moved by using the logarithm of the discharge instead of the discharge itself. 
Therefore, streamflow correlations are made using the logarithms of the 
discharge, i. e., a logarithmic chart (base 10 implied) is used for the dis- 
charge values in the graphical analysis. A logarithmic scale will generally 
normalize the discharge data themselves, as well as the deviations from a 
graph. Under special circumstances other types of transformations may be 
used to normalize the discharge data. For example the logarithmic trans- 
formation fails whenever any of the data equals zero. In such cases a cube- 
root transformation (Stidd, 1953) might be found useful. 

What time units are most suitable for correlative estimates? Average 
discharge for a 10 or 20-year period does not tell us of the variations in an- 
nual discharge; annual discharges do not tell us of the distribution throughout 
the year; but monthly discharge gives a fairly complete picture of the dis- 
charge pattern. The advantages of correlating daily discharge would largely 
be offset by undue labor and time. Ordinarily a month is the shortest period 
of time for which compilations of hydrologic data are available. Although a 
month is hydrologicaliy arbitrary—a storm period may be divided between 
two calendar months—most potential users of correlative estimates of dis- 
charge would probably agree that reasonable estimates of monthly discharge 
are all that are needed in the matter of historic flows, considering that sepa- 
rate diagrams can be prepared for correlation of such salient features as 
peak discharges. 

The example that follows is intended to demonstrate the application of the 
requirements and suggestions set forth above. The easiest way to determine 
the correlation between the flow of two streams is to plot the discharge at 
one station against the discharge at the other station as shown in figure 1, 
which shows a plotting of concurrent monthly discharges of Sepulga River 
near McKenzie, Alabama, against Murder Creek near Evergreen, Alabama, 
for the 60-month period, October 1940 through September 1945. These 
streams, 20 miles apart, are not affected by runoff from snow nor by any other 
factor that produces a seasonal effect on the correlation. Somers (1954) 
demonstrates a method applicable to snow-fed streams used by the Geological 
Survey in studies of the stream-gaging program in the Uinta and Wasatch 
mountains of Utah. 


826-3 


| 


For streams where the correlation is not affected by the season of the 
year we may plot all daily discharges, selected low-flow daily discharges 
(shown in center part of figure 2), discharges for equal percent duration 
(shown in left part of figure 2), discharges for equal recurrence intervals of 
annual low flow (shown in right part of figure 2), or monthly discharges 
(shown in figure 1) depending on how the base data has been compiled, the 
amount of time available, and the accuracy desired, but the position of the 
mean curve will be much the same except possible at the extremes. As the 
log;9 of the discharge is much more normally distributed than is the dis- 
charge itself (Hazen, 1930; Lane and Lei, 1950), this plotting should be made 
on log-log graph paper. Curves developed in different ways as shown in 
figures 1 and 2 are essentially the same. More accurate definitions can be 
obtained by using two or more of the types of data shown. 

There is no clear reason why mean curves based on these different pres- 
entations of the discharges for a concurrent period should be coincident. 
Strictly these can be coincident only when data are proportional throughout 
the range (i. e. graph approximates at 45° slope on logarithmic chart) or 
when the data have a limited range in any period. For example, monthly dis- 
charges are the mean of many diverse values and in fact the mean value it- 
self may be missing from among the daily discharges that compose the month- 
ly mean. On the other hand duration curves are based on ordered data pre- 
pared without regard to the different time intervals in which they may have 
occurred. Under the two conditions cited we may overlook the distinction, 
but where the discharges at two stations have different log standard devia- 
tions, and where there is considerable range in discharges within each month 
then we may expect a graph based on a plot of concurrent monthly discharges 
to be different from that defined by a plot based on the discharges correspond- 
ing to equal percentages on a duration-curve of daily discharges. There 
might be reasonable correspondence in the region of the means but there may 
be pronounced differences at the extremes. The proper graph to use will 
depend on the use to be made of the curve, but our experience indicates fig- 
ure 2 to be fairly typical of the correspondence at the low ends when the range 
in discharges during a month is apt to be limited. In any case a graph drawn 
to average a plot of concurrent monthly means will correspond with a dura- 
tion curve of monthly means. In fact one method of deriving the mean graph 
is to order the monthly means for the two stations, and to plot the two highest 
means, second highest and so forth. Whatever method is used concurrent 
months and minimum 7-day discharge each year should also be plotted in 
order to display the degree of correlation. 

In order to appraise the relative accuracy of various correlation possibili- 
ties it is necessary to evaluate the scatter in the plotting of concurrent data. 
One of the most practical ways to do this is to correlate monthly discharge 
only for a concurrent 5-year period. Suggested steps in such a correlation 
are described as follows: 


1. Plot monthly discharge at Station A against that at Station B on log-log 
paper with the independent variable on the x axis. (In figure 1 we wish to 
see how well Murder Creek could be estimated from Sepulga River). 

2. As shown on figure 1, divide the total spread of the points in the x and y 
directions into 5 to 10 equal parts and draw lines (dotted in figure 1) to divide 
the points into 5 to 10 slices in both directions. There should be at least 3 
points in each end slice. 

3. For each slice (10 vertical and 5 horizontal in figure 1) determine 
graphically the median of the points in both x and y directions. For an odd 
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number of points give the average of the two points adjacent to the middle 
point equal weight with the middle point; averaging to be in terms of scaled 
distance of the graph. 

4. Draw a smooth curve based on these median points. Make the upper end 
of curve 45 degrees (parallel with equal yield line, i. e., drainage area ratio). 
The lower end should probably be another straight line with the two lines con- 
nected by a curve. For comparison, figure 1 also shows a line corresponding 
to the ratio of the drainage areas, which shows how misleading the simple 
application of drainage area ratio for estimates of flows at an ungaged site 
can be, especially at low rates of discharge. 

5. Draw a second curve equidistant (in a vertical direction) from this curve 
such that 1/6 points will lie below the new curve. Draw a third curve such 
that 1/6 points will lie above it. Thus, two-thirds of the 60 points will lie 
between the top and bottom curves. 

6. Estimate the standard error of estimate in log units by determining 
one-half the log-unit distance between these two outer curves. One cycle is 
one log-unit and log-units less than 1.00 indicate the proportional part of a 
cycle. If the log cycle does not fit an engineer’s scale, the log-units may be 
computed by taking the log of the ratio of a discharge on the upper curve to 
the discharge directly below it on the lower curve, since log a - log b=log a/b. 

7. Convert the standard error of estimate to ratio and then to percent. 
Thus, 0.10 log units = 1.26 or + 26 percent and using its colog. 0.90, anti- 
colog = 0.795 or -20.5 percent. The standard error of 0.086 log units shown 
by the correlation in figure 1 is equivalent to +22 percent and -18 percent. 

By plotting monthly discharge to a logarithmic scale and computing the devi- 
ation from the relationship line in logj9 units, and the resulting standard 
error of estimate will be in logj9 units and this corresponds to the same per- 
centage error at all points on the curve. Reporting errors in percentage is 
far more meaningful to a hydraulic engineer than the standard error of esti- 
mate expressed in terms of cubic feet per second, as stream-gaging practice 
generally maintains consistent percentage accuracy throughout the range in 
discharge. A standard error of estimate of 100 cfs might be only 10 percent 
of the mean but could very well be 100 percent during period of low discharge. 
A standard error of 0.05 logi9 units corresponds to + 12% and - 11% through- 
out the range of discharges. 

8. If the errors measured in log units are normally distributed and if the 
mean line is correctly located, two-thirds of all occurrences past and future 
would fall within one standard error of estimate of the mean curve, 95 per- 
cent of points would fall within two standard errors, and practically all would 
fall within three standard errors. 

9. Divide the standard error in log units by number of degrees of freedom 
to obtain the standard error of the curve. (In this example, 60 points less 
an estimated 6 lost degrees of freedom). Thus, if the standard error is 0.10 
log units, the standard error of the curve is 0.10/ ¥60-6 or 0.0135. This is 
the standard error of the curve at its mid-point. The chances are two to one 
(two chances out of three) that at its mid-point, the curve is within one stand- 
ard error of the true curve of relation, and 20 to 1 (95 chances out of 100) 
that it is within two standard errors. Of course, the extremes of the graph 
are less well defined and statistics textbooks show that at a distance of two 
standard deviations from the mean of the dependent variable, the error of the 
error of the curve is 2-1/4 times that at the mid-point. For comparative pur- 
poses this fact is of little importance for it is equally true of all correlations 
and of the extremes of the available record. The further we go toward the 
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extremes of the available record the more we need correlation with long- 
term base stations and the more willing we will be to use them even in the 
face of a larger standard error in the position of the curve at the extremes. 

The line of “best fit” described above corresponds to what is called an 
“engineers” line rather than a least-square line. This distinction is im- 
portant especially where coefficients of correlation are less than 0.90. The 
slope of the least-square line regresses toward the dependent variable (hence 
the term “line of regression”) by a factor equal to the coefficient of correla- 
tion. Thus the standard deviation of the estimates read from a line of regres- 
sion is equal to the coefficient of correlation times the standard deviation of 
the discharges at the base station. But if a tabulation is then made consisting 
in part of original record, and in part of correlative estimates for an extended 
period, there will be a distinct break in the variance in the two parts of the 
array. Although the regression line represents the most probable estimates 
that are associated with the base station, hydrologic conditions stipulate that 
the relationship between two stations be unique. This condition is met, as 
pointed out by Goodrich [1954], by minimizing the sum of the products of the 
departures Ax and Ay, a solution that leads to the simple statement that the 
slope of the line of relationship is equal to the ratio of the standard deviations : 
of the two variables. The coefficient of correlation with relation to an unre- 
gressed line is less than that relative to a least square line, but the difference 
is small for values greater than r = 0.75.* An unregressed line, as is ordi- 
nary engineering practice, is drawn by the device just described, of averaging 
the points in horizontal and vertical strips. 

As mentioned earlier, correlations of snow-fed streams often show a sea- 
sonal effect. This effect, which results from variable rate of snow melt in 
the two basins, largely can be removed by graphic multiple correlation 
[ Somers, 1954]. The month of the year is used as the second independent 
variable but at times it may be found desirable to use other independent vari- 
ables such as relative rainfall on the two basins or discharge at another ad- 
jacent station to explain the scatter of the points on the original correlation. 
Hydrologists will be tempted to adjust for some outstanding departure from 
the graph, but the statistician would warn him always to make the same cor- 
rection throughout his analysis. A prime requirement of correlation analysis 
is systematic adherence to the technique selected. All too often the improve- 
ment made in the plotting of a few bad points may throw many other points 
farther away from the relationship line. Although it is desirable to seek im- 
provement correlation we must be prepared to acknowledge that wide depar- 
tures are as much a part of the correlation process as the correct predictions 
themselves. 

Although the standard error of estimate is a good tool for comparative 
studies, the percentage error indicated by it does not tell the whole story. 

For example, a large standard error of estimate for the estimated discharge 
of a station with a wide range in flow, say from 1 to 10,000 cfs, should not be 
as alarming as the same standard error for a station with a narrow range in 
flow, say from 10 to 100 cfs. The relative improvement in reliability by us- 
ing a correlation based on 5 years of record, as compared to operating the 
station for 5 years with no correlation with a long-term station, would be 
greater for the station with the greater range in flow. The standard deviation 
of the dependent variable as shown in figure 1 is an indication of the range, 


*The coefficient of correlation relative to an unregressed line is V¥2r-1 
where r is the usual least-squares coefficient of correlation. 
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and the correlation coefficient is an indication of the relative improvement. 


Practical limits: There are circumstances when extending a record does 
not yield results commensurate with the effort involved. For example, when 
correlation is poor, or where the record at a secondary station is only slight- 
ly shorter than that at a base station. In the first case the error introduced 
in the correlation estimates are greater than the reduction in sample error 
accomplished by the synthesized longer record; and in the second case, the 
refinement is not worth the effort. It is helpful to set up certain rules as 
guides to the conditions under which extensions are profitable. Creager and 
Justin (1950) devote several pages of their hydro-electric handbook to the 
extensions of streamflow data. They also give an approximate formula for 
permissible or limiting error in the correlation beyond which no benefit will 
be derived from the use of the correlation. However, any correlation for 
which the standard error of estimate is enough less than the standard devia- 
tion to assure us that the correlation is not spurious will provide some im- 
provement. But the real problem is to set up a rule that will assure a mini- 
mum improvement in terms of the quality of the correlation and the relative 
lengths of record at the two stations. The following equation provides a use- 
ful guide: 


N re 


r2+e 


When N is length of long-term record, n is length of short record, r is 
correlation coefficient, and e is the relative reduction in variance of mean 
to be accomplished. Figure 3 shows a graph of this equation taking e as 
0.20; that is, assuming a 20 percent reduction in variance (about a 10 percent 
reduction in standard error of the mean) as the minimum goal. The graph 
shows that where a considerable extension is to be made a poor correlation 
(low coefficients of correlation) is acceptable; where the extension is short, 
correlations must be very good to make the effort worthwhile. The rraph 
shows, moreover, that correlation coefficients less than 0.44 will not meet 
the specifications set, regardless of the amount of extension. It also shows 
that the long record must be at least 1.25 times the short record to make the 
correlation useful, regardless of the degree or correlation. The graph on 
figure 3 is quite general, and is equally applicable to extensions of monthly 
discharges, low discharges, or flood peaks. Since for most pairs of stations 
correlation coefficients are 0.80 or better, the general conclusion is that ex- 
tensions are worthwhile wherever the longer record is at least 25 percent 
longer than the shorter record. 


Degree of Correlation Obtained 


The large number of correlations that have been worked out show in 
general, that the standard error increases with distance between centers of 
drainage basins. As might be expected, low standard errors are obtained 
where the stations are on the same stream. Most of the error is contributed 
by summer storms, and correlation is poorest in those streams where most 
of the flow is produced by summer storms, as in the Great Plains and in the 
dry country generally. Considering the East and the mountain streams 
in the West (with seasonal variation removed), standard errors rarely ex- 
ceed 0.01 logjg units per mile of separation. Coefficients of correlation 
are generally greater than 0.90, and nearly always greater than 0.8. 


826-7 


How Data are to be Used 


A discharge read from a graph of correlation may be identified as a 
“correlative estimate.” This term is proposed to indicate that such figures 
represent a likely value of discharge for any particular period—commonly a 
month—according to a specified method of analysis. Correlative estimates 
may be in error by amounts somewhat greater than those usual in ordinary 
stream gaging. If correlative estimates of discharge are to be used as a 
prediction of the magnitude and distribution of streamflow, any discharge 
that is estimated for some past period becomes merely a means to an end. 
The correlative estimates contains the variability, group distribution, and 
other significant characteristics of an actual record and, therefore, pro- 
vides a figure that is statistically if not hydrologically equally useful in water 
resources design and appraisal. The standard error of the correlative esti- 
mates themselves is the standard error of estimate of the correlation, but 
when the correlative estimates are used in combination with the available 
record, it is the standard error of the mean curve that is important. In fact 
the resulting error in the mean (and other stream flow characteristics) com- 
puted from the combined record is the same as would be obtained by using 
the curve of relation to transfer the mean (or other salient streamflow para- 
meter) at the long term station to the short term station. 

In many problems it may be possible to transfer information concerning 
the magnitude and frequency of low flows from a long-term station to a short- 
term station through the use of the curve of relation without synthesizing a 
table of discharges for the short-term station. Such estimates represent a 
discharge that has the same probability of occurrence (recurrence interval) 
as at the primary station. In such applications our major concern is how well 
defined is the curve of relationship and what errors are involved in the trans - 
fer of frequency data. As stated above the standard error in the final result 
is the same as if the streamflow record had been “extended” using correla- 
tive estimates—that is the square root of the sum of the variance at the pri- 
mary station and variance of the correlation graph. 

To one accustomed to basing plans and designs on long-term streamflow 
records at the site, the suggested use of correlative estimates derived by the 
methods described may appear as a thoroughly undesirable substitute. Yet, 
if we consider that a valid record of the past is only a sample in time and 
that precision of measurement cannot compensate for the uncertainties that 
the future has in store, it can be stated with a fair degree of certainty that 
errors of correlation usually are small relative to the errors of sampling. 
Because coefficients of correlation are nearly always greater than 0.8, the 
error introduced by the correlation is always less than the decrease in sam- 
pling error accomplished by extension if the ratio of the length of the longer 
record to the shorter is over 1.25. Errors of cor,elation cannot be removed 
but they can be reduced to any desired amount through length of operation, 
bearing in mind that the law of diminishing returns ultimately makes it ad- 
vantageous to relocate a station to an ungaged site and so to extend areal 
coverage. The development of sound and efficient methods for extending 
short-term streamflow data is a problem of growing importance to sound 
water development. 

Some 12,500 stream-gaging stations have been operated in the United 
States. Of these about 6,500 are now in current operation with a median 
length of record of 14 years. Of the 6,000 discontinued stations, there are 
many whose value may be augmented through correlation with nearby 
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stations. As shown in figure 4 the stream-gaging network has steadily 
grown through the years. Yet, the basic-data services have a continual up- 


hill struggle to bring a lagging stream-gaging program into balance with 
accelerating water development programs. As the demands for information 
continue to mount, an increasing part of the stream-gaging program may 
necessarily be devoted to operations of secondary stations that are satellite 
to a network of base stations. As the program matures, the short-term satel- 
lite gaging station will become a more important source of information for 
the planning and designing engineer. 
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Figure 4- Number of gaging stations in operation, 
by Geological Survey since 1903. 
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